GBScleanR: robust genotyping error correction using a hidden Markov model with error pattern recognition

您所在的位置:网站首页 concordance lines centrality GBScleanR: robust genotyping error correction using a hidden Markov model with error pattern recognition

GBScleanR: robust genotyping error correction using a hidden Markov model with error pattern recognition

2023-05-28 17:35| 来源: 网络整理| 查看: 265

Abstract

Reduced-representation sequencing (RRS) provides cost-effective and time-saving genotyping platforms. Despite the outstanding advantage of RRS in throughput, the obtained genotype data usually contain a large number of errors. Several error correction methods employing the hidden Markov model (HMM) have been developed to overcome these issues. These methods assume that markers have a uniform error rate with no bias in the allele read ratio. However, bias does occur because of uneven amplification of genomic fragments and read mismapping. In this paper, we introduce an error correction tool, GBScleanR, which enables robust and precise error correction for noisy RRS-based genotype data by incorporating marker-specific error rates into the HMM. The results indicate that GBScleanR improves the accuracy by more than 25 percentage points at maximum compared to the existing tools in simulation data sets and achieves the most reliable genotype estimation in real data even with error-prone markers.

reduced-representation sequencing, error correction, imputation, hidden Markov model Introduction

Reduced-representation sequencing (RRS) is a sequencing technique using next-generation sequencing (NGS) with reducing sequence targets by taking reads only from a limited portion of a genome (Scheben et al. 2017). To meet the demands for cost-effective genotyping systems with dense markers, many different RRS-based genotyping methods have been introduced, including restriction site-associated DNA sequencing (RAD-seq) and genotyping-by-sequencing (GBS) (Baird et al. 2008; Elshire et al. 2011). The reduction in the number of sequence targets allows us for highly multiplexed sequencing to obtain genotype data of a large population such as a hybrid population for genetic mapping and breeding. However, a larger number of samples per NGS run result in genotype data with a lower read coverage per sample (Poland and Rife 2012). Since sequence reads that can be retrieved via NGS are highly stochastic, the limited number of reads can result in the undercalling of heterozygotes, in which recovering only 1 allele at a heterozygous site leads to its incorrect identification as a homozygote. The lack of reads also generates a substantial number of missing genotype calls.

Several error correction tools have been developed to overcome these disadvantages. The method to fill up missing calls and correct the undercalling of heterozygotes was first published in 2009, which determines genotypes based on the allele read ratio within a sliding window (Huang et al. 2009). Genotype-Corrector is also an example of the tool using a sliding window method for error correction (Miao et al. 2018). Recent publications have introduced statistically sophisticated methods using hidden Markov models (HMMs), such as FSFHap, TIGER, LB-Impute, and magicImpute (Swarts et al. 2014; Rowan et al. 2015; Fragoso et al. 2016; Zheng et al. 2018). The existing methods, however, assume constant error rates for all markers. For example, those assume a 50:50 probability that a read could be obtained for 1 of 2 possible alleles at a heterozygous site. Although this assumption is true in an ideal situation, in practice, genotype data contain a significant number of error-prone markers that show skewed probabilities in allele read acquisition as a result of actual biological unevenness (Davey et al. 2013; Wijnker et al. 2013; DaCosta and Sorenson 2014). Variations in genome sequences change the fragmentation patterns of the genomes. Even if 2 independent reads are mapped at the same locus, the sequences of the genomic fragments from which each of the 2 reads originated can vary in terms of GC content and length and sometimes include large insertions or deletions in the unsequenced region of the fragments. Since the GC contents and the lengths of the restriction fragments are known to affect the amplification efficiency, the probability of observing a read for either allele may differ. These biases are likely to be more prominent in a polymorphism-rich population, for example, that derived from a cross between distant relatives. Hence, mismatches occur between the real data, which shows marker-specific error rates, and the models that assume a uniform error rate, resulting in biased genotype estimation and poor error correction accuracy.

Here, we introduce the R package “GBScleanR,” which implements an HMM-based error correction algorithm. Our algorithm estimates an allele read bias and mismapping rate for each marker and incorporates those into the HMM as parameters to capture the skewed probabilities in allele read acquisitions. This paper demonstrates a comparison of GBScleanR and 2 well-established error correction tools: LB-Impute and magicImpute (Fragoso et al. 2016; Zheng et al. 2018). While LB-Impute accepts only biparental populations derived from inbred founders, magicImpute can work on biparental and multiparental populations derived from both inbred and outbred founders. The magicImpute algorithm is also able to estimate founder genotypes simultaneously with the offspring genotypes. Similar to magicImpute, GBScleanR supports simultaneous estimation of founder and offspring genotypes in a biparental and multiparental population. We first show simulation studies to present the accuracy and robustness of GBScleanR using simulation data sets that have severe allele read biases at error-prone markers. The simulation assumes 3 scenarios: a biparental F2 population (homoP2_F2), an outbred F1 population (hetP2_F1), and 8-way recombinant inbred lines (homoP8_RIL). We then demonstrate the reliability and robustness of GBScleanR using real data derived from a cross between distant relatives of rice, which potentially contains many error-prone markers.

Methods Modeling for error correction

Our modeling basically follows the model implemented in magicImpute that has been introduced in Zheng et al. (2018). The error correction algorithm of GBScleanR employs the HMM and treats the observed allele read counts for each SNP marker along a chromosome as outputs from a sequence of latent true genotypes. Our model supposes that a population of No≥1 offspring is originally derived from crosses between Nf≥2 founder individuals. The founders can be inbred lines with homozygotes at all markers or outbred lines in which markers would be heterozygous. Only 1 chromosome is considered for modeling due to the independence of chromosomes. Let Yo={ymio}m=1…M,i=1…No denote the observed allele read counts at marker m in offspring i. The element ymio consists of 2 values, yref and yalt⁠, which represent the reference read count and the alternative read count, respectively. Similarly, the observed allele read counts at marker m in founder j are represented by Yf={ymjf}m=1…M,j=1…Nf⁠. The matrices for hidden true offspring genotypes and hidden true founder genotypes are represented by Xo={xmio}m=1…M,i=1…No and Xf={xmjf}m=1…M,j=1…Nf⁠, respectively. The element xmio takes a value of 0, 1, or 2 to indicate the reference homozygote, heterozygote, and alternative homozygote genotype without phasing information. Unlike the offspring genotype matrix, the founder genotype xmjf stores the phased genotype as xmjf=(x1,x2)⁠, where x1 and x2 indicate alleles at marker m on one of the diploid chromosomes and another in founder j, respectively. The reference allele is represented by 0, while 1 denotes the alternative. Considering the linkage between the markers and the independence of the founder genotypes, we can assume that the sequence of the descendent genotypes from founders does not follow a Markov process, while the sequence of the descendent haplotypes hio does. To enable our estimation problem to be solved in the HMM framework, Ho={hmio}m=1…M,i=1…No is introduced to represent the matrix of the phased descendent haplotypes. The element hmio denotes a pair of descendent haplotypes (h1,h2) at marker m in offspring i. Therefore, h1 and h2 each take one of the natural numbers that are ≤Nf if all founders are inbred or ≤2Nf if outbred to indicate the origins of the descendent haplotypes.

The algorithm estimates Ho and Xf based on Yo and Yf by maximizing the joint probability P(Ho,Xf,Yo,Yf)⁠. We assume the independence of the offspring and the founders and the independence of each founder genotype at each marker. Therefore, our algorithm tries to maximize the following joint probability:

P(Ho,Xf,Yo,Yf)=∏i=1No{∏m=1MP(ymio|hmio,xmf)∏m=2MP(hmio|hm−1,io,xm−1f)P(h1io|x1f)}×∏m=1MP(ymf|xmf)P(xmf),

where P(ymio|hmio,xmf)⁠, P(hmio|hm−1,io,xm−1f)⁠, and P(h1io|x1f) correspond to the emission probability, the transition probability, and the initial probability of the HMM, respectively. P(ymf|xmf) is the probability of observing read counts for the founders at marker m when the combination of the true genotypes of the founders is xmf⁠. The details to derive the equation shown above are available in the Joint probability derivation section in the Supplementary Methods. We assume that P(xmf) follows a discrete uniform probability for all possible combinations of the founder genotypes while omitting cases in which all founders have a same genotype.

Emission probability

To obtain the emission probability P(ymio|hmio,xmf)⁠, 3 additional probabilities are introduced: P(xmi′o|xmio,emmap)⁠, P(ymio|xmi′o,eseq,wm)⁠, and P(xmio|hmio,xmf)⁠. The first 2 probabilities incorporate the effect of read mismapping that would make true genotype unobservable. P(xmi′o|xmio,emmap) is the probability that the observable genotype is xmi′o when read mismapping occurred at marker m of ith offspring that has the true genotype xmio⁠. P(ymio|xmi′o,eseq,wm) represents the probability to observe the reads ymio when the observable genotype is xmi′o⁠. P(xmio|hmio,xmf) is the probability to observe the genotype xmio when the haplotype hmio was descendent from the founders having the genotype xmf⁠. We then rewrite the emission probability as lmio=P(ymio|hmio,xmf,emmap,eseq,wm)⁠, and it holds that

lmio=∑xmioP(ymio|xmio,emmap,eseq,wm)P(xmio|hmio,xmf),P(ymio|xmio,emmap,eseq,wm)=∑xmi′oP(ymio|xmi′o,eseq,wm)P(xmi′o|xmio,emmap),

where P(xmio|hmio,xmf) equals 1 for the possible genotype under the constraint of the descendent haplotypes and founder genotypes in sample i and 0 for the other genotypes. For example, if sample 1 has h1,1o=(h1=1,h2=4) at marker 1, indicating that the haplotypes descended from the first chromosome of founder 1 and the second chromosome of founder 2, and if the founder genotypes are x1,1f=(0,0) and x1,2f=(0,1)⁠, the genotype can only be heterozygotic . The parameter eseq represents the global sequencing error rate that an unexpected allele read can be observed when the true genotype is either homozygote. We only assume that the sequencing error generates a read representing an opposite allele of the true allele. All markers are considered to have a same eseq value. The parameters wm and emmap are the marker-specific allele read bias and mismapping rate at marker m. These are the key parameters that make GBScleanR differ from the other algorithms including magicImpute in which no allele read bias and a constant mismapping rate are assumed. xmi′o indicates the observable genotype that results from accounting for the mismapping with emmap⁠. We assume that emmap takes 2 values (eref,ealt)⁠. eref indicates the probability of incorrectly observing an alternative allele due to mismapped alternative reads when the true genotype is a reference homozygote. Similarly, ealt represents the probability of incorrectly observing a reference allele due to mismapped reference reads when the true genotype is an alternative homozygote. In the calculation of P(xmi′o|xmio,emmap)⁠, we do not consider the undercalling of heterozygotes and missing calls for which P(ymio|xmi′o,eseq,wm) accounts. Therefore, when the true genotype xmio is heterozygous, we can expect to observe that both alleles and an additional incorrect observation of either allele due to mismapping do not change the observable genotype xmi′o that is heterozygous. On the other hand, for example, the incorrectly observed reference allele due to mismapping of reference reads at an alternative homozygous site makes the site can be observed as heterozygous. Thus, the probability P(xmi′o|xmio,emmap=(eref,ealt)) takes the values listed in Table 1. The observed read counts ymio={yref,yalt} follow binomial distribution

P(ymio|xmi′o=0,eseq,wm)∝(1−eseq)yref(eseq)yalt,P(ymio|xmi′o=1,eseq,wm)∝(1−wm)yref(wm)yalt,P(ymio|xmi′o=2,eseq,wm)∝(eseq)yref(1−eseq)yalt,

given the observable genotype xmi′o⁠. To reduce the negative impact of overrepresented mismapping reads that can make the probabilities of less probable genotypes being 0, we add 0.005 to the probabilities for the 3 possible genotypes if any of xmi′o gave P(ymio|xmi′o,eseq,wm)=0⁠. The normalization constant for P(ymio|xmi′o,eseq,wm) is not shown in the equation above. Similarly, the emission probability for founders can be obtained by

lmf∝∏jNfP(ymjf|xmjf,emmap,eseq,wm),

where the normalization constant for lmf is not shown. To omit the less probable founder genotypes, the probability P(ymjf|xmjf,emmap,eseq,wm) is set to 0 if the result is



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3